home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The PC-SIG Library 10
/
The PC-Sig Library - Shareware for the IBM PC and Compatibles (PC-SIG)(Tenth Edition Disks 1-2804)(1991).iso
/
PC_SIGCD
/
07
/
3
/
DISK0731.ZIP
/
INDEXREF
< prev
next >
Wrap
Text File
|
1987-02-09
|
6KB
|
123 lines
REFERENCE INDEX(2)
INDEX generates what is technically called an "inverted index".
Whereas, the disk itself is organized as "files containing words",
the index is organized as "words containing files". While the
notion is simple, the implementation is faced with many issues.
The largest issue is one of space. A simple list of words with
each file name containing the word and a sub list of file names
would be many times the size of the data to be indexed. The
first economy of space is to build a map of file names to file
numbers. Thus a two byte file number can stand for each instance
of a file name in the index.
File names are technically restricted to 64 characters. In
practice, they are much shorter. Still the use of hierarchical
directories requires a list of directory names to describe the
path to a file. INDEX builds a "map" which links file names to
directory names in the path.
To attack the space problem further, the list of words itself
must be attacked. First, INDEX, deals only with letters. No
digits or punctuation are indexed. Second, case is ignored. All
lower case letters are converted to upper case. Third, words
less than 3 characters in length are discarded and words greater
than 7 characters in length are significant only in the first 7
characters. Fourth, each word is discarded if it appears in the
list of "common" words. Each of these methods gives up something
in the retrieval phase to reduce the size of the index.
Still, the list of unique words indexed is large. Acronyms and
proper names added to "normal" words add up to perhaps hundreds
of thousands of possible words. The memory size of most personal
computers limits the number of words in memory to tens of
thousands words at best.
The second issue is one of speed. Unless indexing were performed
at an operating system level, an index can only represent the
state of the disk at one particular time. Thus it becomes
necessary to re-index data periodically. While microcomputer
systems are idle substantial periods of time, it is desirable to
index quickly. Retrieval requires that the list of files
associated with a key word be accessible quickly as well.
INDEX thus makes the assumption that there are only 4093 unique
words. To reduce the set of unique words to 4093, a mathematical
operation (called hashing) is applied to each word generating a
number which is then divided by 4093, the remainder standing for
the word. Further, even though a file may contain many instances
of keywords generating a particular number, only one such
instance needs to be recorded. This greatly reduces the space
and time required to generate an inverted index. The problem
created by this assumption is that two different keywords may
generate the same number. Fortunately, statistics are on our
side and when two or more keywords are used for a retrieval, the
probability of two files containing keywords generating the same
numbers is small.
Consequently, INDEX generates a file containing one entry for
each unique keyword number which in turn points to a file
containing one file number for each unique keyword number, file
number pair.
The following is a general description of the program flow.
INDEX performs initialization, scans the current directory, and
then wraps up.
Initialization consists of writing the copyright notice and
version, processing the parameters, initializing global
variables, reading the EXTENSIO.INX file, reading the WORD.INX
file and finally initializing output files (MAP.INX, ENTRY.INX,
and KEY.INX).
To scan the current directory, the MS-DOS function "find first
file" is issued. If that is successful, the file name is parsed
and the file is analyzed. While successive calls to "find next
file" are successful, each file is analyzed.
To analyze a file, the directory is checked to determine if the
file is a directory or a file. If it is a directory and is not
"." or ".." then a "change directory" is performed and the new
directory is scanned recursively. The nesting level is
incremented each time a sub-directory is encountered and
decremented as each directory scan is completed. Each directory
scanned generates an entry into the MAP.INX file.
If a file is not a directory (i.e. is an ordinary file) and its
extension is not found in the list of excludable extensions, it
is added to MAP.INX pointing back to the parent directory. The
file name is displayed on the console and then the file itself is
scanned.
To scan a file, a new ordered lists of codes is created and
initialized (unless this is a second or subsequent pass in which
case a list is reused). After the file is opened, it is scanned
as blocks of characters. Any type of file (i.e. code, data,
text, etc...) is treated the same way. Lower case letters are
converted to upper case. File scanning has two states; either it
is assembling a word or it is not. If it is not assembling
words, characters are ignored until a letter is encountered in
which case it switches state. If it is assembling words, each
letter encountered effects a mathematical operation which
generates a number.
Finding a non letter terminates assembly of the word. The word
is searched in the common word list. If found, the word is
discarded and the scan continues. If not found the number is
folded onto the range of distinct words and the code for the word
is entered into a sorted list. Scanning continues until the end
of file is reached. Upon reaching the end of a file, a lack of
memory will trigger writing or updating the index file.
Wrapup consists of writing an index, closing the files, and
producing a brief report. Writing the index requires inverting
the collected data which was collected and file order but needs
to be retrieved in keyword code order. For each possible keyword
code the list by file of keyword codes is scanned. Each file
containing the current keyword code is written to KEY.INX. At
the end of each keyword code the current position in KEY.INX is
written to ENTRY.INX.